Transformation-Based Learning for Automatic Translation from HTML to XML

نویسندگان

  • James R. Curran
  • Raymond K. Wong
چکیده

Format tags implicitly represent content information in the same ambiguous, context dependent manner that words represent semantics in natural language. Translation from format to content markup shares many characteristics with tagging and parsing tasks in computational linguistics. The transformation-based learning (TBL) paradigm has recently been applied to numerous computational linguistics tasks with considerable success. We present a transformation-based translator which automatically learns to translate semistructured HTML documents formatted with a particular style to XML using a small set of training examples.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Model for Structured Document Mapping Application to Automatic HTML to XML Conversion

We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on t...

متن کامل

A proposal of an automatic formatting method for transforming XML data

PPX(Pretty Printer for XML) is a query language that offers a concise description method of formatting the XML data into HTML. In this paper, we propose a simple specification of formatting method that is a combination description of automatic layout operators and variables in the layout expression of the GENERATE clause of PPX. This method can automatically format irregular XML data included i...

متن کامل

Automatic Translation of Court Judgments

This document presents an experiment in the automatic translation of Canadian Court judgments from English to French and from French to English. We show that although the language used in this type of legal text is complex and specialized, an SMT system can produce intelligible and useful translations, provided that the system can be trained on a vast amount of legal text. We also describe the ...

متن کامل

From XML to Semantic Web

The present web is existing in the HTML and XML formats for persons to browse. Recently there is a trend towards the semantic web where the information can be can be processed and understood by agents. Most of the present research works focus on the translation from HTML to semantic web, but seldom on XML. In this paper, we design a method to translate XML to semantic web. It is known that onto...

متن کامل

Literature Survey XML-based Transformation Engines

Translation has been an issue for humans since the dawn of communication. The advent of computers has neither lessened the need nor trivialized the task of translating. If anything, with the creation of incompatible computer protocols, the number of things needing translation, another word for transformation, has grown. The seemingly universal extensible markup language (XML) has been touted as...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999